Consider the following situation:
You're at the doctor's office, and the doc rolls up and introduces herself. As she explains, she's adopted a unique approach to diagnosing patients. For each patient, her prognosis is that they don't have cancer, despite whatever symptoms they may exhibit.
You're confused and somewhat concerned, but she assures you she has a track record of 99.9% accuracy.
After some quick googling, you find out cancer occurs in less than 0.1% of the total population.
Do you stay or get the heck outta there?
The doc isn't lying. She may have a 99.9% accuracy, but her diagnosis method is completely useless.
Out of the patients who did have cancer, she didn't predict a single one of them right. You should probably find a new physician.
Accuracy isn't a great measure when it comes to prediction. We need something else — another set of metrics that describe how well the doctor picks up potentially positive cases, or ignores potentially negative cases.
Agenda for today:
We'll start with the usual imports, and set some options so we get consistent results when generating random data
# The usual imports
import pandas as pd
import numpy as np
# Numpy random seed for consistency
np.set_printoptions(precision=4, suppress=True)
np.random.seed(123)
# For visualizatio
import plotly.express as px
import plotly.graph_objects as go
# To model normal distribution
from scipy.stats import norm
# To make data
from sklearn.datasets import make_blobs
Let's make some data! Let's start out with about 5000 observations to get us started.
n = 5000
Here we're using the make_blobs() function to create two classes with similar attributes.
centers = [[9.5, 0], [10.5, 0]] # Define the coordinates to center our blobs (x,y)
X, y = make_blobs(n_samples=n, centers=centers, cluster_std=0.4, random_state=7)
data = pd.DataFrame(X, columns=['feature1','feature2']) # Rename the feature cols
data['target'] = y.astype('str') # Convert dtype to help w/ viz
data.head()
How many observations are in each class?
print('Shape:', data.shape, '\n')
print('Class Frequencies:')
print(data.target.value_counts(normalize=True))
This 50-50 split means we have balance in our target class
fig = px.scatter(data.tail(1000), x='feature1',y='feature2',
color='target', opacity=.6,
color_discrete_sequence = ['red','grey'],
category_orders={"target": ["1", "0"]},
labels={'target':'Target Class',
'feature1': 'Feature 1',
'feature2':'Feature 2'},
marginal_x='histogram',
marginal_y='histogram',
)
fig.show('notebook')
What patterns do you see?
My strategy: Draw a boundary at x = 10, since this best seperates the classes (As seen from the histogram)
fig = px.scatter(data.tail(1000), x='feature1',y='feature2',
color='target', opacity=.6,
color_discrete_sequence = ['red','grey'],
category_orders={"target": ["1", "0"]},
labels={'target':'Target Class',
'feature1': 'Feature 1',
'feature2':'Feature 2'},
marginal_x='histogram', range_x = [8.25,11.75],
)
fig.add_shape(dict(type="line", x0=10, y0=-1.5, x1=10, y1=1.5,
line=dict(color="black",width=2,dash="dash",)
))
fig.show('notebook')
One thing to note: As we move farther from the boundary, the certainty of our prediction gets stronger
In other words, the probability that a particular observation is of class 1 increases as you move further to the right
Let's use this boundary to make our own little classifier. Not from sklearn, just on our own!
You try! Create a predict() function. In other words:
dataframe)1 or 0 based on that — make sure this is an int dtypeclass BoundaryClassifier():
def __init__(self):
from scipy.stats import norm
self.name = 'Classify observations on 1D boundary'
def fit(self, X_train, y_train, x_boundary=None):
self.boundary = x_boundary
def predict(self, X_test):
b = self.boundary
x = X_test.feature1
# You try!
y_pred = (x > b).astype(np.int)
return y_pred
def predict_proba(self, X_test):
b = self.boundary
x = X_test.feature1
# Use the normal distribution to model probabilities
y_pred_proba = ((x-b)/0.4).apply(norm.cdf)
return y_pred_proba
As usual, we'll split up our full dataset into training and testing sets
from sklearn.model_selection import train_test_split
X = data.drop(columns=['target'])
y = data.target.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
Now, we use our little boundary classifier just as we would any sklearn model
clf = BoundaryClassifier() # Create the model
clf.fit(X_train, y_train, x_boundary = 10) # Fit it to the dta
y_pred = clf.predict(X_test) # Predict classes
y_pred_proba = clf.predict_proba(X_test) # Predict the probability of falling into class 1
Let's take a look at some of the predictions by putting them side by side
test_results = pd.DataFrame(data = {'Actual Class':y_test, 'Predicted Class':y_pred, 'Predicted Probabilty':y_pred_proba})
test_results.sample(5)
This makes sense — For most of our observations, the predicted value matched the actual.
Note that the predicted probability coresponds with the predicted class. A more extreme probability (closer to 0 or 1) means that the data point was further from the boundary, so we're pretty certain which class it belongs to.
Question: How many test observations did we get right? In other words, what's the accuracy of our model?
# You try!
acc = len(test_results[test_results['Actual Class'] == test_results['Predicted Class']]) / len(test_results)
acc
# or
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
acc.round(4)
Question: Of the actually positive observations, how many did we correctly predict as positive?
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
Here's a quick helper function to help label the confusion matrix
def custom_confusion_matrix(y_test_, y_pred_proba_, alpha=0.5, output='dataframe'):
"""
Usage:
cm = custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe')
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates')
Params:
alpha: Threshold probability for classification (default = 0.5)
output: One of 'dataframe', 'rates', or 'array'
"""
y_pred_ = (y_pred_proba_ > alpha).map({True:1,False:0})
cf_mat_ = confusion_matrix(y_test_, y_pred_)
if output == 'dataframe':
return pd.DataFrame(cf_mat_, columns=['Predicted 0', 'Predicted 1'], index=['Actual 0', 'Actual 1'])
elif output == 'rates':
return cf_mat_.ravel()
else:
return cf_mat_
cm = custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe')
cm
We can also grab the counts of each category
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates')
tn, fp, fn, tp
How would you calculate Sensitivity (aka True Positive Rate — TPR)?
# You try!
tpr = tp / (tp + fn); tpr.round(4)
What does the following figure describe?
I.e. What metric do we call the red / (gray + red) dots included here?
fig = px.scatter(data.tail(1000), x='feature1',y='feature2',
color='target', opacity=.6,
color_discrete_sequence = ['red','grey'],
category_orders={"target": ["1", "0"]},
labels={'target':'Target Class',
'feature1': 'Feature 1',
'feature2':'Feature 2'},
marginal_x='histogram',
range_x = [10, 11.75],
)
fig.add_shape(dict(type="line", x0=10, y0=-1.5, x1=10, y1=1.5,
line=dict(color="black",width=4,dash="dash",)
))
fig.show('notebook')
We can visualize which datapoints were misclassified in the following plot
data['prediction'] = (data.feature1 > 10).astype(np.int).astype('str')
data['prediction_correct'] = (data.prediction == data.target).astype('str')
fig = px.scatter(data.iloc[y_test.index].sample(n=1000),
x='feature1',y='feature2', opacity=.5,
color='target', symbol='prediction_correct',
color_discrete_sequence = ['red','grey'],
category_orders={'target': ['1','0'],
'prediction_correct':['True','False']},
symbol_sequence=['circle','x'],
labels={'target':'Actual Class',
'feature1':'Feature 1',
'feature2':'Feature 2',
'prediction_correct':'Prediction Correct',
},
marginal_x='histogram', range_x = [8.25,11.75],
)
fig.add_shape(dict(type="line", x0=10, y0=-1.5, x1=10, y1=1.5,
line=dict(color="black",width=2,dash="dash")))
label_names= ['True Positive', 'True Positive',
'False Negative','False Negative',
'True Negative','True Negative',
'False Positive', 'False Positive']
for idx, obj in enumerate(fig.data):
obj['name'] = label_names[idx]
fig.update_layout(legend_title_text='Classification')
fig.show('notebook')
Now let's model the situation from the doctor's office, with a heavy class imbalance.
Here, we'll use a 95-5 split, with only 5% of the target labels being of the positive class
To create this new dataset, we'll resample the old data, but only select proporotions that will generate the desired balance
imbal = 0.05
class0 = data[data.target == '0'].sample(frac=1-imbal) # Take 95% of the former negatives
class1 = data[data.target == '1'].sample(frac=imbal) # Take 5% of the former positives
cols_to_keep = ['feature1','feature2','target']
data_new = pd.concat([class0,class1]).sample(frac=1)[cols_to_keep] # Combine and shuffle the new data
data_new.head()
Note that the resulting dataset has half the original observations (we threw out quite a bit of the positives)
print('Shape:', data_new.shape, '\n')
print('Class Frequencies:')
print(data_new.target.value_counts(normalize=True))
To explore the data, let's take a look at the histograms, both before and after
combined = pd.concat([data,data_new]) # Combine both datasets, then label the source
combined['dataset'] = np.concatenate([np.full(len(data),'50-50 Split'),np.full(len(data_new),'95-5 Split')])
fig = px.histogram(combined, x='feature1', color='target', barmode='overlay',
opacity = 0.6, facet_col='dataset',
category_orders={"target": ["1", "0"]},
color_discrete_sequence = ['red','grey'],
labels={'target':'Target Class',
'feature1': 'Feature 1',
'count':'Frequency',
'dataset':'Data Source'
},
range_x = [8.25, 11.75],
)
for i in [1,2]:
# Add the boundary line to each subplot
fig.add_shape(dict(type="line", x0=10, y0=0, x1=10, y1=150,
line=dict(color="black",width=2,dash="dash",)
),row=1, col=i)
fig.show('notebook')
To model our doctor's "evidence-blind" method, we can create what's known as a "dummy" classifier.
There's different dummy strategies, i.e. randomly predict 1 or 0 with 50% chance each ('50-50'), or always choose a class based on what we tell it ('hard-coded'), or the highest frequency category ('most-freq').
All of these ignore the feature data
class DummyClassifier():
"""
Example Usage:
clf = DummyClassifier(strategy='most')
clf.fit(X_train, y_train, threshold=0.5)
clf.predict(X_test)
"""
def __init__(self, strategy):
self.name = 'Dummy Classifier'
self.strategy = strategy
def fit(self, X_train, y_train, value_to_predict=None):
if self.strategy == 'most-freq':
self.value = y_train.value_counts().index.values[0]
elif self.strategy == 'hard-coded':
self.value = value_to_predict
self.prior = 1 - y_train.value_counts(normalize=True)[value_to_predict]
def predict(self, X_test):
if self.strategy in ['most-freq','hard-coded']:
return pd.Series(np.full(len(X_test), self.prior, dtype=np.int), index=X_test.index)
elif self.strategy == '50-50':
np.random.seed(123)
return pd.Series(np.random.uniform(0,1,len(X_test)).round(0), index=X_test.index)
else:
print('Wrong strategy defined!')
def predict_proba(self, X_test):
if self.strategy == 'hard-coded':
return pd.Series(np.full(len(X_test), self.prior, dtype=np.float), index=X_test.index)
elif self.strategy == 'most-freq':
return pd.Series(np.full(len(X_test), self.prior, dtype=np.float), index=X_test.index)
elif self.strategy == '50-50':
np.random.seed(123)
return pd.Series(np.random.uniform(0,1,len(X_test)), index=X_test.index)
else:
print('Wrong strategy defined!')
Create new train/test splits from our new data: We're just adding a _new suffix onto everything
from sklearn.model_selection import train_test_split
X_new = data_new.drop(columns=['target'])
y_new = data_new.target.astype('int')
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_new, y_new, test_size=0.4, random_state=48)
Now let's use the dummy classifer and specifiy which value to predict (0 or negative) for each observation.
Use a _dummy suffix for the classifier and array outputs here
clf_dummy = DummyClassifier(strategy='hard-coded')
clf_dummy.fit(X_train_new, y_train_new, value_to_predict = 0)
y_pred_dummy = clf_dummy.predict(X_test_new)
y_pred_proba_dummy = clf_dummy.predict_proba(X_test_new)
We're gonna try another dummy model — but one that uses a strategy of predicting 50-50
Use a _rando suffix for the classifier and array outputs here
clf_rando = DummyClassifier(strategy='50-50')
clf_rando.fit(X_train_new, y_train_new, value_to_predict = 0)
y_pred_rando = clf_rando.predict(X_test_new)
y_pred_proba_rando = clf_rando.predict_proba(X_test_new)
Finally let's create a legitimate boundary classifier
Use a _bc suffix for the classifier and array outputs here
clf_bc = BoundaryClassifier() # Create the model
clf_bc.fit(X_train_new, y_train, x_boundary = 10) # Fit it to the dta
y_pred_bc = clf_bc.predict(X_test_new) # Predict classes
y_pred_proba_bc = clf_bc.predict_proba(X_test_new) # Predict the probability of falling into class 1
Let's use our confusion matrix function from before to see the results
cm_dummy = custom_confusion_matrix(y_test_new, y_pred_proba_dummy, output = 'dataframe')
cm_dummy
acc_dummy = accuracy_score(y_test_new, y_pred_dummy); acc_dummy.round(4)
tn, fp, fn, tp = custom_confusion_matrix(y_test_new, y_pred_proba_dummy, output = 'rates')
tpr_dummy = tp / (tp + fn); tpr_dummy.round(4)
cm_rando = custom_confusion_matrix(y_test_new, y_pred_proba_rando, output = 'dataframe')
cm_rando
acc_rando = accuracy_score(y_test_new, y_pred_rando); acc_rando.round(4)
tn, fp, fn, tp = custom_confusion_matrix(y_test_new, y_pred_proba_rando, output = 'rates')
tpr_rando = tp / (tp + fn); tpr_rando.round(4)
cm_bc = custom_confusion_matrix(y_test_new, y_pred_proba_bc, output = 'dataframe')
cm_bc
acc_bc = accuracy_score(y_test_new, y_pred_bc); acc_bc.round(4)
tn, fp, fn, tp = custom_confusion_matrix(y_test_new, y_pred_proba_bc, output = 'rates')
tpr_bc = tp / (tp + fn); tpr_bc.round(4)
Let's take a look at our original data classifier on the balanced dataset.
In our code before, we implicitly said that "if the predicted probability of being 1 is greater than 0.5, go ahead and classify it as 1.
A more explicit function call:
cm = custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe', alpha=0.5)
cm
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates')
tpr = tp / (tp + fn); tpr.round(4)
What if we were to decrease the probability threshold?
I.e. If we lowered the evidence needed to call a particular observation as 1, would we now be predicting more or less 1s?
fig = px.scatter(data.iloc[y_test.index].sample(n=1000), x='feature1',y='feature2',
color='target', opacity=.6,
color_discrete_sequence = ['red','grey'],
category_orders={"target": ["1", "0"]},
labels={'target':'Target Class',
'feature1': 'Feature 1',
'feature2':'Feature 2'},
marginal_x='histogram', range_x = [8.25,11.75],
)
for step in np.arange(9, 11.2, 0.25):
fig.add_trace(go.Scatter(x=[step, step],
y=[-1.5, 1.5],
name="Decision Boundary",
mode="lines",
line=dict(color="black",width=2,dash="dash"),
visible=False))
fig.data[8].visible = True
steps = []
for i in range(4,len(fig.data)):
step_label = str(round(i/4,2)+8)
step = dict(
method="update",
args=[{"visible": [False] * len(fig.data)}], # layout attribute
label = step_label)
step["args"][0]["visible"][i] = True # Toggle i'th trace to "visible"
for q in range(0,4):
step["args"][0]["visible"][q] = True # Toggle data traces to "visible"
steps.append(step)
sliders = [dict(active=4,
currentvalue={"prefix": "Decision Boundary at x= "},
pad={"t": 50, 'b':10},
steps=steps)]
fig.update_layout(sliders=sliders)
fig.show('notebook')
a = 0.3
custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe', alpha=a)
Clearly, we see more observations are being moved into the Predicted 1 column. How does that impact sensitivity? What about specificity?
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates', alpha=a)
tpr = tp / (tp + fn); tpr.round(4)
You try! What would happen if we were to increase the probability threshold?
a = 0.7
custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe', alpha=a)
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates', alpha=a)
tpr = tp / (tp + fn); tpr.round(4)
If the cost of a false negative were less than for a false positive, i.e. not calling out a positive case was really bad, which alpha (of the ones above) would you choose? Why?
def make_roc(y_test, y_pred_proba, steps=200):
steps = 200
alpha_vals = np.linspace(0,1,steps+1)
tpr_lst = []
fpr_lst = []
for a in alpha_vals:
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, alpha=a, output='rates')
tpr_lst.append(tp / (tp + fn))
fpr_lst.append(fp / (tn + fp))
roc = pd.DataFrame([alpha_vals, tpr_lst, fpr_lst], index=['alpha','tpr','fpr']).T
return roc
def draw_roc(roc_curve, title, margins=0.02, connect_dots=True):
fig = px.scatter(roc_curve, x='fpr', y='tpr', color='alpha',
range_x=[0-margins,1+margins], range_y=[0-margins,1+margins],
color_continuous_scale=px.colors.sequential.Burg,
labels={'tpr':'True Positive Rate',
'fpr':'False Positive Rate',
'alpha':'Alpha'
},
title = title,
width=540,height=500
)
fig.add_shape( # add a 45-degree line
type="line", line_color="gray", line_width=3, opacity=1, line_dash="dash",
x0=0, x1=1, xref="x", y0=0, y1=1, yref="y")
fig.update_yaxes(tick0=0, dtick=.2)
fig.update_xaxes(tick0=0, dtick=.2)
fig.update_xaxes(showline=True, linewidth=2, linecolor='gray')
fig.update_yaxes(showline=True, linewidth=2, linecolor='gray')
if connect_dots:
fig.add_trace(go.Scatter(x=roc_curve.fpr,
y=roc_curve.tpr,
mode="lines",
line=dict(color="gray",width=1),
showlegend=False
))
return fig
roc_bc = make_roc(y_test_new, y_pred_proba_bc)
roc_bc.head()
fig_bc = draw_roc(roc_bc, title='ROC of Boundary Classifier on 95-5 Dataset', margins=0.03)
fig_bc.show('notebook')
roc_rando = make_roc(y_test_new, y_pred_proba_rando)
fig_rando = draw_roc(roc_rando, title='ROC of 50% Rando Classifier on 95-5 Dataset', margins=0.03)
fig_rando.show('notebook')
As you can tell, the ROC is what allows us to really differentiate crappy models from good ones.
What does the 45-degree line tell us?
roc_dummy = make_roc(y_test_new, y_pred_proba_dummy,steps=2)
fig_dummy = draw_roc(roc_dummy, title='ROC of Dummy Classifier on 95-5 Dataset', margins=0.03)
fig_dummy.show('notebook')
roc = make_roc(y_test, y_pred_proba)
fig = draw_roc(roc, title='ROC of BoundaryClassifier on 50-50 Dataset', margins=0.03)
fig.show('notebook')
In a single metric, we can summarize the curve by the area under curve, or AUC (of the ROC curve)
We'll use the sklearn.metrics.auc function for this
from sklearn.metrics import auc
auc(roc.fpr, roc.tpr)
print('ROC AUC of BoundaryClassifier on 50-50 Split dataset:', auc(roc.fpr, roc.tpr).round(4))
print('ROC AUC of BoundaryClassifier on 95-5 Split dataset:', auc(roc_bc.fpr, roc_bc.tpr).round(4))
print('ROC AUC of RandoClassifier on 95-5 Split dataset:', auc(roc_rando.fpr, roc_rando.tpr).round(4))
print('ROC AUC of DummyClassifier on 95-5 Split dataset:', auc(roc_dummy.fpr, roc_dummy.tpr).round(4))
Try creating either a KNN or Decision Tree classifier on the balanced (original dataset).
You can use the existing train test splits, but make sure to: